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Abstract 

We present a comprehensive new framework for handling biologicaUy 
accurate models of molecular evolution. This model provides a system- 
atic framework for studying models of molecular evolution that implement 
heterogeneous rates, conservation of reading frame, differing rates of inser- 
tion and deletion, customizable parametrization of the probabilities and 
types of substitutions, insertions, and deletions, as well as neighboring 
dependencies. We have stated the model in terms of an infinite state 
Markov chain in order to maximize the number of applicable theorems 
useful in the analysis of the model. We use such theorems to develop 
an alignment-free parameter estimation technique. This alignment-free 
technique circumvents many of the nuanced issues related to alignment- 
dependent estimation. We then apply an implementation of our model to 
reproduce (in a completely alignment-free fashion) some observed results 
of Zhang and Gerstein [29] regarding indel length distribution in human 
ribosomal protein pseudogenes. 



1 Introduction 

While substitution models of molecular evolution have been well studies and 
developed, the inclusion of insertions and deletions (indels) into biologically 
accurate models has enjoyed less success. As remarked in [5], a robust and 
comprehensive understanding of probabilistic indel analysis (and its relation to 
sequence alignment) is still lacking. A number of models of molecular evolution 
that include substitutions, insertions and deletions have been proposed ([3], [23] . 
[22] J [21], I2S]), but what is missing is a comprehensive mathematical structure 
in which can be cast the problem developing of biologically accurate models 
of molecular evolution. In fact, it is this lack of a well-studied mathematical 
structure that leads to the analytic intractability of some proposed indel models 
(as mentioned in [22]). This lack of a unifying structure not only gives rise to 
computational entities with no biological counterpart such as "fragments" ([26]. 
[3S]) and the embedding of a given sequence into an infinite sequence ([Hj), but 
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also leads to difficulties in comparing models, their assumptions, and their ap- 
plicability. For example, the relationships between various substitution models 
of molecular evolution are well understood and these relationships can be easily 
compared and contrasted (as in [35] and [H]). This is due to most traditional 
substitution models being stated in terms of instantaneous rate matrices of a 
finite state Markov chain. Due to the variety of mathematical tools that have 
been used to implement indels in various applications (for example: HMM's [3], 
rate grammars transducers [13], birth-death processes [55] [IS]) no such 
immediate comparison of models is possible. 

A few structures have been suggested, most notably the framework of a finite 
state transducer ([2]). While finite state transducers indeed seem promising, we 
take a slightly different route that comes from the direction of dynamical systems 
(and so will allow for the application of many well-known dynamical systems' 
tools). 

We present in this paper a comprehensive mathematical theory in which one 
can explore biologically accurate models of molecular evolution. We also present 
an alignment-free method of estimating parameters. 

Our model allows for the incorporation of substitutions, insertions and dele- 
tions of any length less than or equal to a specified length N. We also incor- 
porate rate heterogeneity, and neighboring dependencies (context dependency) 
up to a specified distance. The model is discrete time (it can be viewed as a 
generalization of a stochastic grammar) and the assumptions are clearly stated: 
First, we assume that in one time unit, (possibly different) segments of DNA 
will not be both deleted and inserted. In one step, only insertions or deletions 
are allowed, not both. Second, we assume reversibility, though as we will see, 
this assumption can easily be relaxed. These are the only inherent assumptions 
in this model, but since our model contains a high degree of fiexibility further 
assumptions can be made if, for example, one insists on using, say, the HKY 
(|9J) model as the underlying substitution model. 

The mathematical language in which we cast this model is that of a discrete, 
time-homogeneous Markov chain on infinitely (countably) many states. This 
framework Also allows for the possibility of recasting the model in terms of 
a time- inhomogeneous Markov chain so as to allow for the evolution of the 
rate of evolution, but in this paper we remain in the time-homogeneous case 
for notational simplicity. No exotic constructs (like fragments, immortal or 
normal links, etc.) are used. This will maximize the number of probabilistic and 
mathematical tools that can be used to analyze this model. In the literature, this 
kind of mathematical environment is referred to as a "Random Substitution" |15) 
as mathematicians refer to a insertions, deletions, and substitutions collectively 
as "substitutions." 

Being that this model is cast in the robust, well studied environment of 
countable state Markov chains, exact, closed form solutions to many statistical 
questions become possible. Thus we can in some cases circumvent the nuanced 
issues introduced with estimation procedures (like the issue of attaining a local 
maximum versus global maximum for Hidden Markov Model MLE parameter 
estimation [5], and other such issues mentioned in j23 j .|22 j ). 
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2 Definition of the Model 



We present now the rigorous definition of the model. We proceeded in two 
steps: First wc define an infinite state Markov chain to incorporate insertions 
and substitutions. Second we construct the induced reversible Markov chain 
which will incorporate deletions. To construct the first Markov chain, we need: 

1. A = {ai, . . ., at} a finite set of ordered symbols referred to as an alphabet. 

2. For each letter a G A, (fia, -Pa) a finite (non-empty) probability space. 

3. For each letter a G A, a function ga : fla ^ A* with the property that if 
|fi'a('*^)| > Ij then the word ga{i^) begins with a. 

Let A" denote all words of length n formed from letters of A. Then A* = 

Un>i./l" is the set of finite length words formed from A. The alphabet A is 
usually equal to either {A, C, T, G} for DNA models, or {R, H,K,...,V} for 
amino acid models. The probability spaces {fla,Pa) encapsulate the probabili- 
ties of insertion and substitution. In particular, the cardinality of fla gives the 
number of different substitution, insertion, and deletion types that are allowed 
to occur at the letter a G A. The functions : — )• A* (and particularly, 
the ranges of functions ga) specify the set of allowable substitutions and inser- 
tions. In particular, if one wishes to allow the substitution of the letter b G A 
to occur at the letter a €A, then the function ga should evaluate to b on some 
Pa-non-zcro element uji of fla'- (7a(<^i) = b. If one wishes to allow the inser- 
tion of the n-length word vi . . .Vn G A^ to occur after the letter a G A, then 
the function ga should evaluate to avi . . .Vn on some Pa-non-zero element iJ2 
of fla- 30(^2) = avi . . .Vn- Notice that the word vi . . .Vn is preceded by a in 
50(^2) = avi ■ ■ - Vn, this is to assure that vi . . .Vn has been genuinely inserted 
into the sequence. Lack of the initial a would cause the net effect of a being 
deleted, then vi . . . u„ being inserted into the created gap. 

We now define the Markov chain representing random substitutions and 
insertions. 

Definition 2.1 (Random Substitution-Insertion) . A random substitution-insertion 

(RSI) (with fixed A, {{^a, Pa)}aeA, and {ga}aeA) is an infinite state Markov 
chain {A* , P) with transition operator P defined in the following way. For 
u = b\ . . .bn G A* a word, we let 0„ = Oftj x • • ■ x fib^ and P„ = Pb^ x • • • x P^^ . 
We define gu ^ A* via concatenation of words: foruj = (wi, . . . , w„) € fi„, 
9u{oj)= 5bi(wi) . ..gb^{uin). Now define P by 



So one can think of the given model in the following way: instead of modeling 
the evolution of individual nucleotides (or fragments) the Markov transition 
operator P operates on entire sequences. Indeed, the state space of this model 
is A*: the set of all sequences. The transition operator P{u,v) (probability of 





(1) 
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transition from the sequence u to the sequence v) takes into consideration every 
insertion and substitution possible in one time unit to compute the appropriate 
probability. 

We now construct the induced reversible A^Iarkov chain, which will serve to 
incorporate deletions. Recall first the definition of a reversible Markov chain: 

Definition 2.2 (Reversible Markov chain). A Markov chain {X,P) with state 
space X and transition operator P, is said to he reversible if there exists a 
function m : X — >• (0, oo) such that for all x,y G X, 

m{x)P{x,y) = m{y)P{y,x) 

We now define the Markov chain which will serve as our comprehensive 
model of molecular evolution. 

Definition 2.3 (Reversible Random Substitution- Insertion-Deletion), we de- 
fine the random substitution, insertion, deletion (RSID) model {A* ,R) to be the 
Markov chain with the transition matrix given by 



_ P{x,y) + P{y,x) 

This Markov chain is reversible with m{x) given by m{x) = l+X^zeyi* 
Since any random substitution has finite range \{y : P{x,y) > 0}| < oo, the 
above is well defined. Note that as mentioned in the introduction, we can 
completely circumvent the reversibility criterion (and simultaneously allow for 
different rates for insertions and deletions) by modifying the above definition in 
the following way. If we wish insertions to have a rate Wi and deletions to have 
a rate of tt^ (one can easily make these rates depend on time, or location in a 
given sequence), then we can use the RSID model {A, R) with R given by 

_ 7r,P(x,?y)+^,P(?y,.T) 

Summarizing, (A* , R) is an infinite state, discrete-time Markov chain, with 
the state space representing entire DNA or amino acid sequences. We have 
implemented this RSID model as a discrete-time Markov since we wish to model 
mutations that occur due to inherent DNA replication infidelities, not mutations 
due to environmental factors. Such mutations can be accurately modeled in a 
discrete time fashion. The transition probability R{u, v) between two sequences 
u and V takes into account every possible substitution that could have happened 
when evolving the sequence u into v in one evolutionary step (the time unit 
can be taken to be a single replication). The transition probability R{u,v) 
also takes into accoimt all possible substitution and insertion paths leading 
from u to V, as well as all possible substitution and deletion paths leading 
from u to V. Again, the first assumption is that both insertions and deletions 
do not simultaneously happen in one step. Rather, to allow insertions and 
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deletions, one needs to consider the n-th step transition matrix: R'-'^\u,v) 
which is the {u, entry in the n}^ matrix product of R with itself. This n- 
step transition probability represents the probability (summed over all possible 
paths) of evolving the sequence u into the sequence u in n time units. So for the 
purpose of measuring the total evolutionary distance from u to u, one considers 
the so called Green's function: 

oo 

G(7i,i;) = ^i?(")K«) (4) 

The Green's function represents the total measure of ever evolving the sequence 
u into the sequence v. The function G(-, •) can then be used as a measure of 
evolutionary distance between sequences (with the obvious applications to say, 
phylogenetics) . 

3 Example 

We present here a toy example to elucidate the above definitions. We also 
present notation that succinctly summarizes given model. In this example, say 
we wish to have a mathematical object that represents a model where substi- 
tutions are described by Kimura's two parameter model |14| with transition 
probability equal to 0.2 and transversion probability equal to 0.1. Thus the 
instantaneous rate matrix for substitutions has the form: 





A 


C 


T 


G 


A 


' 0.6 


0.1 


0.1 


0.2 


C 


0.1 


0.6 


0.2 


0.1 


T 


0.1 


0.2 


0.6 


0.1 


G 


0.2 


0.1 


0.1 


0.6 



Say we also desired the model to allow only one letter insertions or dele- 
tions, with the probability of an indel occurring being 0.01, and the choice of 
the particular indel type being given by the uniform distribution. Thus, tran- 
sitions occur with probability 0.2 * (1 — 0.01) — 0.198, transversions occur with 
probability 0.1 * (1 — 0.01) = 0.099 and insertions and deletions both occur with 
probability 0.01/4 = 0.0025. 

Using the notation introduced in section [51 the alphabet is given by yi = 
{A, G, T, G}. Each of the probability spaces Via consist of eight elements. We 
provide the details regarding VI a, Pa, and gA since the other definitions are 
completely analogous. Now as stated VIa = {1,2,3,4,5,6,7,8} since there are 
eight allowable events that can happen at the base A: substitution to one of 
four other bases, plus four possible indels. Also, Pa and gA are defined by 
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Pa 


9A 


1 


0.594 


A 


2 


0.099 


C 


3 


0.099 


T 


4 


0.198 


G 


5 


0.0025 


AA 


6 


0.0025 


AC 


7 


0.0025 


AT 


8 


0.0025 


AG 



So, for example, we have that Pa(1) = 0.594 and 5a(1) = A. Note that the last 
four rows of the above table represent the insertion or deletion of A, C, T, or 
G respectively. To succinctly represent the probability spaces and functions for 
this particular example, we use the following notation: let E represent the n-th 
coordinate process random variable corresponding to the Markov chain given 
in definition 12.31 whose parameter values were determined above. Then we can 
represent E utilizing the notation given in table [1] 

4 Flexibility of the Model 

In this section we present the flexibility of the RSID model given in definition 
[2:21 

Traditional Substitution Models of Molecular Evolution The RSID 

model allows for the implementation of most previous substitution models of 
molecular evolution. For example, by using the alphabet A = {A,C,T,G}, 
and for each a e A, letting Qa = {1,2,3,4}, and ga{^a) — {A,C,T,G}, and 
choosing the probabilities Pa appropriately, the RSID model given in definition 
[^completely encompasses the JC [H], HKY [9], FEL81 [6^, K2P TIJ, and REV 
[24] models. In fact, in this particular case the RSID model is a generalization of 
all possible homogeneous rate Markov models of DNA or amino acid evolution. 
This is due to the fact that if ga(ria) = A for each a G A, then the RSID model 
simply becomes a traditional finite state Markov chain (with as many states as 
the cardinality of the alphabet A). 

Heterogeneous rates Models utilizing heterogeneous rates of evolution can 
be introduced by slightly modifying definition 12 . 1 1 and consequently 12 . 21 Instead 
of fixed probability spaces (fla,Pa)i we allow the probability space to change. 
Let 3'((7a(0^)) denote the set of probability measures on ga{^a)- Then the 
desired heterogeneity can be introduced with a random probability (also known 
as a random element in the literature [1]) i.e. a probability- valued random 
variable: Xa : {flu,Pu) J'(<7a(f^a)). Then definitions 12.11 and 12.31 work just 
as before, but instead by utilizing the spaces (Oa,Xa). Hence, the RSID model 
can also incorporate heterogeneous evolution rates. This is similar in spirit to 
the idea of the "variety of fragments" utilized in i25) . 
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Table 1: Notation describing the Markov chain example from section [3] 



A with probability 0.594 
C with probability 0.099 
T with probability 0.099 
G with probability 0.198 
AA with probability 0.0025 
AC with probability 0.0025 
AT with probability 0.0025 
AG with probability 0.0025 



S : <^ 



A with probability 0.099 
C with probability 0.594 
T with probability 0.198 
G with probability 0.099 
CA with probability 0.0025 
CC with probability 0.0025 
CT with probability 0.0025 
CG with probability 0.0025 

A with probability 0.099 
C with probability 0.198 
T with probability 0.594 
G with probability 0.099 
TA with probability 0.0025 
TC with probability 0.0025 
TT with probability 0.0025 
TG with probability 0.0025 



A with probability 0.198 
G with probability 0.099 
T with probability 0.099 
G with probability 0.594 
GA with probability 0.0025 
GC with probability 0.0025 
GT with probability 0.0025 
GG with probability 0.0025 



Neighboring Dependencies We can introduce neighboring dependencies by 
again slightly modifying definition 12.11 For u — bi . . .bn G A* , instead of using 
the probability P„ — P^^ x • • • x Pi,^ , we can use a different probability Pu 
whose marginal distributions correspond to the Pf,^ , . . . , Pt^ ■ This is simply a 
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utilization of a coupling. A coupling between two probability spaces (fii, Pi) and 
{n,2, P2) is a probability C on the space fii x whose marginal distributions are 
Pi and P2. Hence the original definition l2.11 which assumes that what happens 
at a specific nucleotide (be it substitution, insertion, or deletion) is independent 
of its neighbors, simply uses the null coupling. Of course, the specific coupling 
to be used depends on the situation at hand, we are simply elucidating the 
various mathematical constructs that may be used to implement the desired 
biological properties. 

Parameterization Now the RSID model is not meant to be implemented in 
its most general form, but rather parameterized to a certain degree taking into 
consideration the problem at hand. For example, we sketch here a possibility 
of parameterizing the indel appearance rate to depend only on a two parameter 
a,l3 power law aL~^ on the length L of the particular indel. To accomplish 
this, all one needs to do is define the Pa in the following way: for a; £ fia, let 
Pa(w) = a\ga{io)\~^ . Such an RSID would then be able to accurately model, for 
example, the human ribosomal protein pseudogene evolution as studies in |29j . 
Further nuanced parameterizations are possible and easily implemented into 
the RSID model. For example, one can easily use a distribution on the possible 
indels that not only takes length into consideration, but also GC content. 

Implementable Biological Phenomena Due to the flexibility of this model, 
we summarize here the various biological phenomena that can be implemented 
by using the RSID model and its variants mentioned above. The RSID model 
provides a systematic framework for studying models of molecular evolution that 
implement heterogeneous rates, conservation of reading frame (through careful 
selection of the functions ga), variation in conservation, differing rates of inser- 
tion and deletion, customizable parameterization of the probabilities and types 
of substitutions, insertions, and deletions available (through the specification of 
the probabilities (fia, Pa)), as well as neighboring dependencies. 

5 Algorithms and Implementation 

The aim of this paper is not to present a preprogrammed set of algorithms, 
but rather a comprehensive framework to be adapted to the specific situation 
at hand. So instead of enumerating a plethora of algorithms that can be used 
in a variety of implementations, we rather present a general method for pro- 
ducing efficient algorithms useful for computation. We later will also show how 
setting the model on a firm mathematical footing allows for the possibility for 
general theorems to be applied. This will present new avenues for algorithmic 
implementations. 

One of the computational difficulties is in the calculation of the transition 
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probabilities (see equation [T|). Rewriting equation [3 we obtain 



(5) 



Now the above equation [5] is an example of the so called sum-product algo- 
rithm: since the underlying factor graph is a tree, the results of Kschischange 
et al. |16) can be applied to develop an algorithm that calculates this sum in 
linear time. Hence the full RSID transition probability in equation [5] can also 
be calculated in linear time. Dynamical programming methods can also be used 
at this step. 

Now if one wishes to measure the total (evolutionary) distance between 
the sequences u and v by using the Green's function G(w, v) found in equa- 
tion HI it is typically intractable to attempt to compute the entire infinite sum 
Xli^i but in |15] . it is proved that the approximation 



converges to the full summation G{u, v) exponentially quickly. So in practice one 
needs only calculate Gn{u, v) for some adequately large value of n (representing 
one to n generations of mutations from u to v). Then [15j provides error esti- 
mates for this approximation. Now G'„(-, •) is an infinite matrix, but the entire 
basis oi A* need not be used. In fact, at worst, in the computation of G„(u,w) 
one needs only the basis elements U™^!* for m — \u\ (maxaeyiwen^ |ga(w)|)". In 
practice, this large of a basis is not needed and one can utilize far fewer ele- 
ments. Since DNA sequences are not free to explore the entire space A* one can 
use, for example, GC-content or entropy requirements to eliminate extraneous 
basis elements. After obtaining an appropriate basis, the matrix sum G„(u,w) 
can be computed using, for example, the O(?ilogn) algorithm found in [10 and 



Again, since this article is meant to introduce an analytic environment in 
which to study biologically accurate models of molecular evolution, we do not 
provide explicit algorithms, but have rather indicated that such algorithms can 
be developed and tailored to the specific problem at hand. Being that the tran- 
sition probabilities can be calculated in linear time, and then Green's function 
in loglinear time, the main computational issue is determining an appropriate 
basis of "biologically viable" sequences for the computation of G„. This ba- 
sis determination can be accomplished by a combination of various techniques 
including GC-content and entropy methods. 

Alternative Algorithms We now demonstrate the advantage of having a 
well-developed mathematical background underlying the RSID model by pre- 
senting an alternative method of implementing the calculation of the Green's 
function G{u,v). Utilizing a slight modification of the proof of the spectral 
theorem for self-adjoint compact operators ([?, Theorem 1, pg. 895), it can be 



n 




(6) 
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shown that R is diagonahzable. Utihzing this diagonahzabihty, on can compute 
and hence G in a straightforward theoretical fashion. This provides an- 
other avenue for investigating the possible algorithmic implementations of the 
RSID model. 

6 Parameter Estimation 

Once a particular parameterization choice has been made, the problem of pa- 
rameter estimation can be approached in a number of ways. While parameters 
can be estimated using MLE methods via counting indel and substitution types 
and frequencies over a number of optimal and sub-optimal alignments (or all 
alignments), we present here a new alignment free method of parameter estima- 
tion. The method is based on frequencies of occurrence of subwords and follows 
the exposition of [TS]. The main argument is based on the fact that while one 
may assume the independence of substitution and indel events that occur at 
neighboring nucleotides, this does not in general imply the independence of the 
frequency of occurrence of neighboring nucleotides (or n-mers of nucleotides) 
due to the inclusion of indels. This fact can be taken advantage of to estimate 
parameters. Briefly, the frequencies can be calculated explicitly (as in |15j). and 
then these are used to assist in a maximum likelihood estimation. We begin by 
sketching how one can calculate the expected frequency of n-mers. 

First we need a vector of occurrences: For w E A* , ai £ A, let Fi{w) = 
{\w\ai, ■ ■ ■ ,\w\a^) denote the column vector of occurrences of single letters in 
the sequence w. Let also Em denote the n*^ coordinate process associated to 
{R,A*) where letters are taken m at a time with overlaps (detail in [TJ]. We 
also define the matrices 

Definition 6.1 (Substitution Matrices). For a given RSID Markov chain (i?. A*) 
and integers m, n, define the m-th substitution matrix entry wise as follows: for 
ij eA*, 

(a4:)) =E,ii](r)|, (7) 

Here \w\i means the number of occurrences (with overlap) of i in u;, and the 
Ej denotes expectation taken with unit mass on the Markov state corresponding 

to j. Let A denote the dominant eigenvalue of The following was proved 

in [H] 

Lemma 6.1. For an RSID Markov chain (R,A*), the sequence of vectors 
A"^"''*^ converges almost surely and exponentially quickly to the strictly posi- 
tive eigenvector of corresponding to A with this eigenvector being indepen- 
dent of w G A* . 

Normalizing the strictly positive eigenvector of M^'^ corresponding to A 
gives the expected frequency of appearance of the letter A. This implies that 
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the observation frequencies converge exponentially quickly to the expected fre- 
quencies. Similar theorems in |15j are proved for arbitrary n-mers of letters 
from A. These frequencies are given in terms of the various parameters that 
were chosen for the probabilities Pa in the definition of 12.11 For single letters 
frequencies in the DNA case, this results in four constraints (linear in the en- 
tires of M^,^-*), and hence can estimate up to four parameters. Counting couplets 
of nucleotides gives 16 quadratic constraints; in general, counting n-mers gives 
4" equations of degree n in the entires of Setting these equations equal 

to the observed frequencies, one can then use standard nonlinear optimization 
techniques. 

For example, if one was given a data set and wished to model it using an 
RSID that used the Kimura two parameter 7,5 model ([H]) to describe the 
substitutions and a two parameter power law {Pa{uj) = a\ga{uj)\~^) to describe 
the indel distribution, one would only need to count single letter frequencies 
to obtain four equations in the four parameters a, /3, 7, S. The maximal likeli- 
hood parameter estimator can then be found using any preferred optimization 
technique. 

It is important to note that this method of parameter estimation is com- 
pletely alignment free. This circumvents the myriad issues involved when, for 
example, estimating parameters in a classical substitution model of molecular 
evolution: choosing a particular alignment algorithm, a particular alignment 
parameterization (linear, log-linear, affine), particular mismatch, match, gap 
opening, and gap extensions penalties. It is hard to overstate the advantages of 
having an alignment-free parameterization technique. This is mainly due to the 
fact that choosing a particular alignment scheme is a nuanced endeavor where 
slight changes in implementation can lead to large changes in alignment out- 
come. Furthermore, it has been observed that various algorithms have potential 
to introduce hidden bias (see [19], [18], [20], [21], [8], [27], [7]). 

7 Application to Human Ribosomal Protein Pseu- 
dogenes 

We applied a simple implementation of the RSID model to the data set used in 
[29] to measure the indel length distribution in human ribosomal protein pseu- 
dogenes. Note that we wish to measure the underlying intrinsic indel length 
distribution in the evolution of human ribosomal protein pseudogenes, not (as 
|29| did) to estimate this distribution via comparison with a different species. 
We used the Kimura two-parameter model ([14j) for the substitutions and a two 
parameter power law {Pa{uj) — a\ga{uj)\~^) to describe the indel distribution. 
We also assumed reversibility and time homogeneity. Using the parameter es- 
timation technique detailed above, we obtained a power law of indel length L 
distribution of .4955L^^-^'^'*°. Compare this to the result of [H] which gave a 
deletion distribution of ASL^^'^^ and insertion distribution of .53L~^-^'^. The 
discrepancy is most likely due to the fact that we considered insertions and 
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deletions together as well as assumed reversibility, whereas the results of [22] 
clearly indicate different rates for insertions and deletions. However, the basic 
power-law similarity indicates the feasibility of our alignment free parameter 
estimation technique. 

8 Conclusion 

We have presented a comprehensive new framework for handling biologically 
accurate models of molecular evolution. As we have demonstrated, the num- 
ber of implementable biological phenomenon is vast. One profound advantage 
of stating the RSID in the language of an infinite state Markov chain is that 
one can utilize the vast mathematical literature to rigorously analyze a given 
implementation. We used such theorems to develop an alignment-free param- 
eter estimation technique. This alignment-free parameter estimation technique 
circumvents many nuanced issues related to alignment-dependent estimation. 

We then applied one possible implementation of the RSID model to ana- 
lyze the power-law governing the indel distribution in human ribosomal protein 
pseudogene sequences [29]. Our analysis corroborated the observations of [29] . 
It is of great interest to note that when it comes to indel length distribution, 
our RSID model gave similar power law parameters as [29j without needing 
to consider the complicated alignments described in the "Sequence Alignment" 
section of [55]. 
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